Àá½Ã¸¸ ±â´Ù·Á ÁÖ¼¼¿ä. ·ÎµùÁßÀÔ´Ï´Ù.
KMID : 1022420210130040047
Phonetics and Speech Sciences
2021 Volume.13 No. 4 p.47 ~ p.53
End-to-end non-autoregressive fast text-to-speech
Kim Wi-Back

Nam Ho-Sung
Abstract
Autoregressive Text-to-Speech (TTS) models suffer from inference instability and slow inference speed. Inference instability occurs when a poorly predicted sample at time step t affects all the subsequent predictions. Slow inference speed arises from a model structure that forces the predicted samples from time steps 1 to t-1 to predict the sample at time step t.
In this study, an end-to-end non-autoregressive fast text-to-speech model is suggested as a solution to these problems. The results of this study show that this model's Mean Opinion Score (MOS) is close to that of Tacotron 2 - WaveNet, while this model's inference speed and stability are higher than those of Tacotron 2 - WaveNet. Further, this study aims to offer insight into the improvement of non-autoregressive models.
KEYWORD
deep learning, neural network, speech synthesis, Text-to-Speech (TTS)
FullTexts / Linksout information
Listed journal information
ÇмúÁøÈïÀç´Ü(KCI)